.pn1 .po4 A VERY SIMPLE CHART PARSER Peter Ross Dept. of AI, University of Edinburgh, 80 South Bridge, Edinburgh EH1 1HN, Scotland. This describes how to use the (very) simple chart parser written in Prolog in the file "chart.pro". You will need LOTS of stack space - the program does no assertion or retraction at run time, but keeps all the run-time data in lists. The program expects data in the form of Prolog clauses: [1] rule(Tag, Category, ExpansionList). The 'Tag' argument is just an arbitrary Prolog term used to group together a set of rule/3 and lexical/3 clauses. The parser needs to be told this tag in order to know which set of these clauses to use as data. Since it is arbitrary, you could use a compound term to allow you to selectively include rules, at no extra cost. The 'Category' argument identifies a syntactic category, e.g. noun or sentence. The 'ExpansionList' is a list giving one possible expansion of that category into sub-categories. Example: rule(simple, sentence, [noun_phrase, verb_phrase]). The categories (both Category and those in ExpansionList) can be compound terms, thus allowing the use of variables for result passing and context setting. The parser copies terms as needed, when unification might be a bad thing to do because it might cause supposedly independent data structures to become too specific. This copying, by the way, is done by dismantling the source structure and building a duplicate rather than by the crude assert/retract method; the latter tends to be much slower but takes less run-time space. NOTE: categories are not supposed to be defined recursively. Nevertheless, the code does check whether you are adding an edge you already have. The check does slow things a bit, but is specifically a check against left recursion at the time when empty active edges are added. A fuller check against adding duplicate inactive edges has been commented out; you probably don't need it unless you mess with the rule tags at run time so as to cause new parsing rules to be introduced during parsing. [2] lexical(Tag, Word, CategoryList). The 'Tag' argument is as above. The 'Word' is a word that may legitimately appear in the input. The 'CategoryList' is a list of the possible syntactic categories which that word might fit, e.g. 'fly' might be a noun or verb. As above, the categories can be arbitrary terms. [3] initial_category(Tag, Category). This states what the highest level category is - it should be unique. Š [4] strategy(Tag, Strategy). This should instantiate Strategy to either 'td' for top-down or 'bu' for bottom-up. The parser does not check it for validity. [5] policy(Tag, Policy). This should instantiate Policy to either 'df' for depth-first or 'bf' for breadth-first. The parser does not check it for validity. [6] ersatz_category(Tag, DummyCategoryName). By default, the category name 'user' is specially used by the system. It supposes that there is an extra rule of the form rule(Tag, user, [TopLevelCategory]) when doing top-down parsing. This extra category will of course appear in the final chart, and should help in the later analysis of it. If you are really attached to the category name 'user' but want to do some top-down parsing, then you can persuade the system to use a different name instead by including a clause of this form. Although this ersatz category can be useful, its original purpose was to make the program shorter at negligible speed cost by ensuring that the top-level category appeared on the left of a single rule. IMPORTANT: The rule/3 and lexical/3 predicates are only used during the initialisation stage, when the transformed rules are constructed and when the initial chart is created. There are three important data structures manipulated by the program: [a] edge(Category, FoundList, NeedsList, StartVertex, EndVertex). This describes an edge in the chart. If you don't know what this means, read Winograd's book "Language As A Cognitive Process, vol 1" or the articles by Henry Thompson and Graeme Ritchie in "Artificial Intelligence: Tools, Techniques and Applications" by O'Shea and Eisenstadt (published by Harper and Row). There are three odd details. Items on the 'FoundList' are of the form Category = VertexNumber (the starting, left-hand vertex of that instance of the category). A word from the input will appear in the 'FoundList' as word(Word) = VertexNumber The 'FoundList' is ordered so that the most recently found category is first. [b] the chart: ActiveEdgeList + InactiveEdgeList. The edges are segregated into two lists for convenience. The parser returns such a chart when it finishes. [c] the agenda: CandidateList - Hole. This is a difference list (a list with a variable at the end - 'Hole' is the variable). The items in the list are of the form ActiveEdge + InactiveEdge and are candidates (guaranteed successful, in fact) for an application of the fundamental rule of chart parsing. The default monitoring system prints out agendas. The program has two top level predicates. They are: Š [i] parse(Tag, WordList, MaxVertex, Chart). The Tag and WordList must be given. The MaxVertex and Chart must be variables. When the parse is done, MaxVertex will be instantiated to the largest vertex number (the lowest is 0), and Chart will be instantiated to the final chart. When doing top-down parsing, the parser adds one ersatz rule of the form rule(Tag,user,[TopMostCategory]) - there will be edges for the ersatz category 'user' to help you to examine the chart afterwards for successful parsings. You can substitute what you want for 'user' - see above. This predicate does some transformations on the rule/3 clauses supplied by the user. The transformations are all done by the predicate invert_rules(Tag). [ii] make_chart(Tag, WordList, MaxVertex, Chart). This is exactly like parse/4 above, but assumes that the rule transformations have already been done. When a parse has been done and a final chart has been produced, you may want to add to the word list and extend the chart, for example if using the parser for plan recognition. You can do so by the following means: call initial_setup(Tag, Strategy, Policy, NewWordList, MinVertex, NewMaxVertex, OldChart, NewInitialChart, AgendaOfYourChoice, NewAgenda), chart(Tag, Strategy, Policy, NewInitialChart, NewAgenda, FinalChart) This will assume that more words are added between MinVertex and NewMaxVertex, and will add to the OldChart and AgendaOfYourChoice you gave it, to create a new FinalChart. Since all the information needed about the presence of any previous set of words (presumably ending at MinVertex) will be contained in the OldChart you got out of parsing that previous word sequence, this will incrementally add to the chart. Normally parsing stops when the parser runs out of agenda. You may discover a need to suspend parsing under program control, for example if you want to doctor the chart or extract useful information from it before the entire input sequence is available. If so, you may redefine the test stop_parser(Tag,Strategy,Policy,Chart,Agenda,FinalChart) which currently does no more than instantiate the result FinalChart to Chart when Agenda is empty. It is the success of this predicate which halts parsing. The implementation of the fundamental rule is explicit within the code, so that it is easy to change for your own purposes. There is a simple monitoring mechanism. The predicates watch(Tag) nowatch(Tag) turn it on or off for the specified rule set. If on, it reports when the rule transformations are complete, and whenever an item on the agenda is processed, it tries to call the predicate user_mon(Tag,Strategy,Policy,OldChart,OldAgenda,NewChart,NewAgenda) If this fails, it uses a default strategy of printing out the new chart and agenda and waiting for a before continuing. ŠThere is a simple test rig, called by the predicate test(Tag). It makes use of a trivial utility print_chart(Chart) to print out the active and inactive edges after the parse has finished. The lists of edges are sorted before being printed, into the standard order of Prolog terms. A sample rule set can be found in the files "test01.pro" and "test02.pro". IMPLEMENTATION: SOME THOUGHTS This program clearly shows the need for some kind of user-definable indexing in Prolog in order to make such things efficient. Unfortunately record/3 is not much use, because the key can only usefully be atomic (only the principal functor name matters) and in C-Prolog the key cannot be an integer (nor is record/3 that fast in C-Prolog). At present all you've really got is the clause indexing mechanism, so you're stuck with assert/retract and all that that implies if you want to cut down on searching through big data structures such as long lists. Gimme arrays, gimme hunks, gimme a decent record package, gimme hash tables, gimme something!! Prolog-2 does let you specify hash tables for particular predicates, but these are only used by the calling mechanism. Wouldn't it be nice to have some way of specifying that anything with a given principal functor was to be accessed through a set of hash tables, sub-indexing specifiable by the user in some way, e.g. akin to that offered by PEARL for frame access? Some other Prologs (e.g. Arity Prolog) do now offer such things as B-trees and hash tables for term storage. At the moment, for simplicity, the chart is only represented as two lists - one of active edges, one of inactive edges. Since the chart only grows, never shrinks, it might be better to represent it as two ordered trees peppered with holes (variables). This ought to speed things up a lot. I considered the idea of not allowing Prolog variables in the terms which represent categories; this would end all temptation to misuse unification. Instead, I have in the past used terms of the form ?Atom as variables (where ?/1 is defined as a prefix operator, none of this messy atom exploding done by LISPers who don't/can't alter their character tables). This leads to passing binding lists about a lot. As far as the ersatz unification in edge generation goes, there is no significant speed difference between using Prolog variables and these pseudo-variables. However, there is a difference in speed when looking for candidate edge pairs, since with Prolog variables I can just use the dodge of specifying \+(\+(Category1=Category2)), which checks whether they will unify, without doing so, in a reasonably efficient way. With pseudo-variables I'd need to indulge in term smashing or suffer the overhead of carrying binding lists about in the chart. The latter is not so awful as it might look, since it would be possible to construct the unifier at the same time as checking candidacy, and just carry the unifier about till needed. But, this still looks a bit worse than using Prolog variables.